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Abstract 



In recent years analysis of complexity of learning Gaussian mixture models from sampled data 
has received significant attention in computational machine learning and theory communities. In 
this paper we present the first result showing that polynomial time learning of multidimensional 
Gaussian Mixture distributions is possible when the separation between the component means is 
arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of k 
identical spherical Gaussians in n-dimensional space with an arbitrarily small separation between 
the components, which is polynomial in dimension, inverse component separation and other input 
parameters for a fixed number of components k. The algorithm uses a projection to k dimensions 
and then a reduction to the 1-dimensional case. It relies on a theoretical analysis showing that 
two 1-dimensional mixtures whose densities are close in the L 2 norm must have similar means 
and mixing coefficients. To produce the necessary lower bound for the L 2 norm in terms of the 
distances between the corresponding means, we analyze the behavior of the Fourier transform of a 
mixture of Gaussians in one dimension around the origin, which turns out to be closely related to 
the properties of the Vandermonde matrix obtained from the component means. Analysis of minors 
of the Vandermonde matrix together with basic function approximation results allows us to provide 
a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original 
space. Additionally, we present a separate argument for reconstructing variance. 



1 Introduction 

Mixture models, particularly Gaussian mixture models, are a widely used tool for many problems of statistical 
inference ||2T1 [T9l [181 [TT1 [TTl . The basic problem is to estimate the parameters of a mixture distribution, 
such as the mixing coefficients, means and variances within some pre-specified precision from a number 
of sampled data points. While the history of Gaussian mixture models goes back to ll20l . in recent years 
the theoretical aspects of mixture learning have attracted considerable attention in the theoretical computer 
science, starting with the pioneering work of J9), who showed that a mixture of k spherical Gaussians in 
?i dimensions can be learned in time polynomial in n, provided certain separation conditions between the 
component means (separation of order y/n) are satisfied. This work has been refined and extended in a 
number of recent papers. The first result from J9) was later improved to the order of fi(n^) in ifTOl for 
spherical Gaussians and in \2\ for general Gaussians. The separation requirement was further reduced and 

j 3 

made independent of n to the order of Q(k^) in ||23l for spherical Gaussians and to the order of 0(^|-) 
in [ 15 1 for Logconcave distributions. In a related work [1 1 the separation requirement was reduced to fi(fc + 
\/fclogn). An extension of PCA called isotropic PCA was introduced in [3 1 to learn mixtures of Gaussians 
when any pair of Gaussian components is separated by a hyperplane having very small overlap along the 
hyperplane direction (so-called "pancake layering problem"). 

In a slightly different direction the recent work irPJl made an important contribution to the subject by 
providing a polynomial time algorithm for PAC-style learning of mixture of Gaussian distributions with 
arbitrary separation between the means. The authors used a grid search over the space of parameters to 
a construct a hypothesis mixture of Gaussians that has density close to the actual mixture generating the 
data. We note that the problem analyzed in [13] can be viewed as density estimation within a certain family 
of distributions and is different from most other work on the subject, including our paper, which address 



parameter learning. 

We also note several recent papers dealing with the related problems of learning mixture of product 
distributions and heavy tailed distributions. See for example, Ifl2l [8ll5ll6l. 

In the statistics literature, J7) showed that optimal convergence rate of MLE estimator for finite mixture 
of normal distributions is 0{^n), where n is the sample size, if number of mixing components k is known in 
advance and is 0(n~i ) when the number of mixing components is known up to an upper bound. However, 
this result does not address the computational aspects, especially in high dimension. 

In this paper we develop a polynomial time (for a fixed k) algorithm to identify the parameters of the 
mixture of k identical spherical Gaussians with potentially unknown variance for an arbitrarily small sepa- 
ration between the component^. To the best of our knowledge this is the first result of this kind except for 
the simultaneous and independent work fl4l . which analyzes the case of a mixture of two Gaussians with 
arbitrary covariance matrices using the method of moments. We note that the results in lfT4l and in our paper 
are somewhat orthogonal. Each paper deals with a special case of the ultimate goal (two arbitrary Gaussians 
in lfl4l and k identical spherical Gaussians with unknown variance in our case), which is to show polynomial 
learnability for a mixture with an arbitrary number of components and arbitrary variance. 

All other existing algorithms for parameter estimation require minimum separation between the compo- 
nents to be an increasing function of at least one of n or k. Our result also implies a density estimate bound 
along the lines of ff3l . We note, however, that we do have to pay a price as our procedure (similarly to that 
in |fl3l ) is super-exponential in k. Despite these limitations we believe that our paper makes a step towards 
understanding the fundamental problem of polynomial learnability of Gaussian mixture distributions. We 
also think that the technique used in the paper to obtain the lower bound may be of independent interest. 

The main algorithm in our paper involves a grid search over a certain space of parameters, specifically 
means and mixing coefficients of the mixture (a completely separate argument is given to estimate the vari- 
ance). By giving appropriate lower and upper bounds for the norm of the difference of two mixture distri- 
butions in terms of their means, we show that such a grid search is guaranteed to find a mixture with nearly 
correct values of the parameters. 

To prove that, we need to provide a lower and upper bounds on the norm of the mixture. A key point 
of our paper is the lower bound showing that two mixtures with different means cannot produce similar 
density functions. This bound is obtained by reducing the problem to a 1 -dimensional mixture distribution 
and analyzing the behavior of the Fourier transform (closely related to the characteristic function, whose 
coefficients are moments of a random variable up to multiplication by a power of the imaginary unit i) of 
the difference between densities near zero. We use certain properties of minors of Vandermonde matrices 
to show that the norm of the mixture in the Fourier domain is bounded from below. Since the L 2 norm is 
invariant under the Fourier transform this provides a lower bound on the norm of the mixture in the original 
space. 

We also note the work |16], where Vandermonde matrices appear in the analysis of mixture distributions 
in the context of proving consistency of the method of moments (in fact, we rely on a result from |fl6l to 
provide an estimate for the variance). 

Finally, our lower bound, together with an upper bound and some results from the non-parametric density 
estimation and spectral projections of mixture distributions allows us to set up a grid search algorithm over 
the space of parameters with the desired guarantees. 

2 Outline of the argument 

In this section we provide an informal outline of the argument that leads to the main result. To simplify the 
discussion, we will assume that the variance for the components is known or estimated by using the estimation 
algorithm provided in Section 1331 It is straightforward (but requires a lot of technical details) to see that all 
results go through if the actual variance is replaced by a sufficiently (polynomially) accurate estimate. 

We will denote the n-dimensional Gaussian density ^ exp J by K(x, fi), where 

R" or, when appropriate, in M. k . The notation || • j will always be used to represent L 2 norm while dn (•, •) 
will be used to denote the Hausdorff distance between sets of points. Let p(x) = J2i=i a iK( x , Mi) be a 
mixture of k Gaussian components with the covariance matrix a 2 I in M". The goal will be to identify the 
means fi { and the mixing coefficients a,; under the assumption that the minimum distance — fij\\,i 7^ j 
is bounded from below by some given (arbitrarily small) <i m i n and the minimum mixing weight is bounded 

'Note that density estimation is generally easier than parameter learning since quite different configurations of param- 
eters could conceivably lead to very similar density functions, while similar configurations of parameters always result in 
similar density functions. 

2 We point out that some non-zero separation is necessary since the problem of learning parameters without any 
separation assumptions at all is ill-defined. 



from below by a m ; n - We note that while a can also be estimated, we will assume that it is known in advance 

to simplify the arguments. The number of components needs to be known in advance which is in line with 

other work on the subject. Our main result is an algorithm guaranteed to produce an approximating mixture 

p, whose means and mixing coefficients are all within e of their true values and whose running time is a 

polynomial in all parameters other than fc. Input to our algorithm is a m i n , a, k, N points in R™ sampled from 

p and an arbitrary small positive e satisfying e < The algorithm has the following main steps. 

Parameters: a min , d min ,a, k. 

Input: e < m <p , N points in R" sampled from p. 

Output: 6*, the vector of approximated means and mixing coefficients. 

Step 1. (Reduction to k dimensions). Given a polynomial number of data points sampled from p it is 
possible to identify the fc-dimensional span of the means /Xj in R™ by using Singular Value Decomposition 
(see l23l ). By an additional argument the problem can be reduced to analyzing a mixture of k Gaussians in 
R fc . 

Step 2. (Construction of kernel density estimator). Using Step 1, we can assume that n = k. Given a 
sample of N points in R fc , we construct a density function p^de using an appropriately chosen kernel density 
estimator. Given sufficiently many points, ||p — Pkde\\ can be made arbitrarily small. Note that while pkde is 
a mixture of Gaussians, it is not a mixture of k Gaussians. 

Step 3. (Grid search). Let = (R fc ) fc x R fc be the k 2 + fc-dimensional space of parameters (component 
means and mixing coefficients) to be estimated. Because of Step 1, we can assume (see Lemma[T|i /^s are in 
R fe . 

For any = fi 2 , ■• • , Afei °0 = <*) G ®> l et P( x : 9) be the corresponding mixture distribution. 
Note that 9 = (m, a) £ are the true parameters. We obtain a value G (polynomial in all arguments for 
a fixed fc) from Theorem [4] and take a grid Mo of size G in 0. The value 9* is found from a grid search 
according to the following equation 

9* = argmin I \\p(x, 9) - p kde \\ \ (1) 

We show that the means and mixing coefficients obtained by taking 6* are close to the true underlying 
means and mixing coefficients of p with high probability. We note that our algorithm is deterministic and the 
uncertainty comes only from the sample (through the S VD projection and density estimation). 

While a somewhat different grid search algorithm was used in 1 13], the main novelty of our result is 
showing that the parameters estimated from the grid search are close to the true underlying parameters of 
the mixture. In principle, it is conceivable that two different configurations of Gaussians could give rise to 
very similar mixture distributions. However, we show that this is not the case. Specifically, and this is the 
theoretical core of this paper, we show that mixtures with different means/mixing coefficients cannot be close 
in L 2 norrrQ (Theorem|2]i and thus the grid search yields parameter values 6* that are close to the true values 
of the means and mixing coefficients. 

To provide a better high-level overview of the whole proof we give a high level summary of the argument 
(Steps 2 and 3). 

1. Since we do not know the underlying probability distribution p directly, we construct pude, which is a 
proxy for p = p(x, 6). pkde is obtained by taking an appropriate non-parametric density estimate and, 
given a sufficiently large polynomial sample, can be made to be arbitrarily close to p in L 2 norm (see 
Lemma [TTTi. Thus the problem of approximating p in L 2 norm can be replaced by approximating pkde- 

2. The main technical part of the paper are the lower and upper bounds on the norm \ \p(x, 6) —p(x, 0)\\ in 
terms of the Hausdorff distance between the component means (considered as sets of fc points) m and 
in. Specifically, in Theorem|2]and Lemma[3]we prove that for 6 = (rh, at) 

d H (m,rh) < f(\\p(x, 0) - p(x, 0)||) < h(d H (m,rh) + ||a - d||i) 

where /, h are some explicitly given increasing functions. The lower bound shows that dn (m, rh) can 
be controlled by making \\p(x, 0) — p(x, 0)\\ sufficiently small, which (assuming minimum separation 
dmin between the components of p) immediately implies that each component mean of m is close to 
exactly one component mean of rh. 

On the other hand, the upper bound guarantees that a search over a sufficiently fine grid in the space 
will produce a value 9*, s.t. \\p(x, 0) — p(x, 9*)\\ is small. 

3 Note that our notion of distance between two density functions is slightly different from the standard ones used in 
literature, e.g., Hellinger distance or KL divergence. However, our goal is to estimate the parameters and here we use L 2 
norm merely as a tool to describe that two distributions are different. 



3. Once the component means m and rh are shown to be close an argument using the Lipschitz property 
of the mixture with respect to the mean locations can be used to establish that the corresponding mixing 
coefficient are also close (Corollary|5]l. 

We will now briefly outline the argument for the main theoretical contribution of this paper which is a lower 
bound on the L? norm in terms of the Hausdorff distance (Theorem[2]i. 

1. (Minimum distance, reduction from R fc to M 1 ) Suppose a component mean /z is is separated from every 
estimated mean jx^ by a distance of at least d, then there exists a unit vector v in R fe such than Vjj 

| (v, {fi i — Hj))\ > jjTi- In other words a certain amount of separation is preserved after an appropriate 
projection to one dimension. See Lemma[T3lfor a proof. 

2. (Norm estimation, reduction from M. k to R 1 ). Let p and p be the true and estimated density respectively 
and let v be a unit vector in R fe . p v and p v will denote the one-dimensional marginal densities obtained 
by integrating p and p in the directions orthogonal to v. It is easy to see that p v and p v are mixtures 
of 1-dimensional Gaussians, whose means are projections of the original means onto v. It is shown in 
Lemma[l4]that 

\\p-pf> (^) b.-^ll 2 

and thus to provide a lower bound for \\p — p\\ it is sufficient to provide an analogous bound (with a 
different separation between the means) in one dimension. 

3. (1-d lower bound) Finally, we consider a mixture q of 2k Gaussians in one dimension, with the assump- 
tion that one of the component means is separated from the rest of the component means by at least t 
and that the (not necessarily positive) mixing weights exceed a m in in absolute value. Assuming that the 
means lie in an interval [—a, a] we show (Theorem|6]l 

for some positive constant C independent of k. 
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The proof of this result relies on analyzing the Taylor series for the Fourier transform of q near zeros, 
which turns out to be closely related to a certain Vandermonde matrix. 

Combining 1 and 2 above and applying the result in 3, q = p v — p v yields the desired lower bound for \ \p— p\\. 

3 Main Results 

In this section we present our main results. First we show that we can reduce the problem in K™ to a corre- 
sponding problem in R fc , where n represents the dimension and k is the number of components, at the cost 
of an arbitrarily small error. Then we solve the reduced problem in R fe , again allowing for only an arbitrarily 
small error, by establishing appropriate lower and upper bounds of a mixture norm in M. k . 

Lemma 1 (Reduction from R™ to R fc ) Consider a mixture ofk n- dimensional spherical Gaussians p(x) = 
^2i=i c*iK(x, fj,j) where the means lie within a cube [—1,1]™, ||^ — fij\\ > c? m in > 0,V^j and for all 
i, azi > a m in- For any positive e < ^f^- and S £ (0, 1), given a sample of size poly ^ ea " ■ log (|), 

with probability greater than 1 — S, the problem of learning the parameters (means and mixing weights) ofp 
within e error can be reduced to learning the parameters of a k-dimensional mixture of spherical Gaussians 
p (x) = ct%K{x, Vi) where the means lie within a cube [— -y/^, \f^\ k ' \\ v i ~ > > 0, V^j. 

However, in R fc we need to learn the means within | error. 

Proof: For i = 1, . . . , k, let Uj € R" be the top k right singular vectors of a data matrix of size poly ^ • 

log ( j-) sampled fromp(a;). It is well known (see l23l ) that the space spanned by the means {£tj}^ =1 remains 
arbitrarily close to the space spanned by {vi} k =1 . In particular, with probability greater than 1 — S, the 
projected means {p>i} k =1 satisfy — < | foralH (see Lemma[T5ll. 

Note that each projected mean jl i € R n can be represented by a A; dimensional vector Vi which are 
the coefficients along the singular vectors VjS, that is for all i, fi i = Sj=i v n v r Thus, for any i 7^ 
j, ll/ij - p.j\\ = \\v>i - v>j\\. Since H/^ - /x^- 1| > d min - | - | = d min - e > d nlin ~ = we have 



frT]k 

along the top k singular vectors VjS 

■ 

corresponding representation fi i G K™ such that p, i = Xw=i ^ij v j an d \\f L i~ Mil — H^j^^ll- This implies 



v % \ > Also note that each Ui lie within a cube of [— y^, where the axes of the cube are 

Now suppose we can estimate each Vi by i>i € R k such that — Ui\\ < %. Again each Vi has a 



for each i, ||a*» — A**ll < ~ h\\ + llAi - A* II < 
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From here onwards we will deal with mixture of Gaussians in K . Thus we will assume that p Q denotes 
the true mixture with means {vi} k =1 while p Q represents any other mixture in M. k with different means and 
mixing weights. 

We first prove a lower bound for \\p — p a 1 1 . 

Theorem 2 (Lower bound in R fe ) Consider a mixture of k k- dimensional spherical Gaussians p (x) = 
J ^2 k =1 aiK(x,Ui) where the means lie within a cube \p%\ k > ~ > > 0>^i#j an d 

for all i,a.i > a m in- Let p (x) = Yli=l ^i-^( x ^i) be some arbitrary mixture such that the Hausdorff 
distance between the set of true means m and the estimated means rh satisfies dnim, rh) < =gia. Then 

\\Po — Po\\ 2 > ( tt ™° J ( d "!i™k2 n ' > ) where C, care some positive constants independent of n,k. 

Proof: Consider any arbitrary v>i such that its closest estimate i>i from rh is t = \\vi — £>i\\. Note that 
t < and all other i/j, ^ i are at a distance at least t from i/,-. Lemma PT3l ensures the existence 
of a direction v £ R fe such that upon projecting on which \(v, (i/, — > jlr and all other projected 

means (v, i/j), (v, vf),] ^ i are at a distance at least ^p- from (d, Note that after projecting on v, 
the mixture becomes a mixture of 1 -dimensional Gaussians with variance a 2 and whose projected means 
lie within [—y/n, yjn\. Let us denote these 1 -dimensional mixtures by p v and p v respectively. Then using 

Theorem|6] \\p v — p v \\ 2 > {~^~^\ ■ Note that we obtain p v (respectively p v ) by integrating p Q 

(respectively p ) in all {k — 1) orthogonal directions to v. Now we need to relate \\p a — p a \\ and \\p v — p v \\. 

This is done in Lemma fl4l to ensure that \\p Q — p \\ 2 > (-^) k \\p v — Pv\\ 2 where c > is in chosen such a 
way that in any arbitrary direction probability mass of each projected Gaussian on that direction becomes 

negligible outside the interval of [— ca/2, ca/2]. Thus, \\p Q — p \\ 2 > ( an >'° 4 ) ^-J- 7 ) ck _ Since this holds 
for any arbitrary Ui, we can replace t by dn (m, rh). ■ 

Next, we prove a straightforward upper bound for \\p a — p a \\ . 

Lemma 3 (Upper bound in R fc ) Consider a mixture of k, k-dimensional spherical Gaussians p {x) = 
Si=i a iK(x, Ui) where the means lie within a cube [— y/^, \f^] k > \\ u i ~ > > 0,Vj^j and 

for all i,on > a m i n . Let p {x) = Yli=i OiiK(x,Ui) be some arbitrary mixture such that the Hausdorff 
distance between the set of true means m and the estimated means rh satisfies dn(fn, rh) < Smia. Then 
there exists a permutation n : {1,2,. . . ,k} — > {1, 2, , . . . , k} such that 



- n 1 I 1 19 d H (rn,rh) \ 

Wpo-PoW < (27rg 2 )fc /2 2^ I V |a '~ a7r(t)l + — V 2 — ) 

Proof: Due to the constraint on the Hausdorff distance and constraint on the pair wise distance between 
the means of m, there exists a permutation ir : {1, 2, . . . , k} —> {1, 2, , . . ., k} such that — i>^!j\ || < 
dnim, rh). Due to one-to-one correspondence, without loss of generality we can write, 
\\Po - Poll < Y!l=i \ \9i\\ where g { (x) = a l K(x, v { ) - a n ^K(x, Now using Lemma[l6] 

llftll 2 < j^tjw (oj + al (i) - 2a i a <z} cxp (- ""'X^'"" )) 



< 



^jyr ((oti ~ "^(i)) 2 + 2a i a 7r(l ) ^1 - exp (- ^— 2 ^ (o11 

(27TCT 2 ) fc ^ ' U 7T(l)J T ct 2 y ■ 

We now present our main result for learning mixture of Gaussians with arbitrary small separation. 

Theorem 4 Consider a mixture of k n-dimensional spherical Gaussians p(x) = Xa=i &iK(x, /tx^) where 
the means lie within a cube [—1, 1]™, ||^ — /a .|| > d m \ n > 0, Vj^j and for all i, cti > a m in- Then given any 
positive e < and S € (0,1), there exists a positive C\ independent of n and k such that using a sample of 



I j • log (§) ) fln^ « gn'of Mg of size G = jSfi ( 8K e fc 2 ) 1 , £>wr algorithm given 

_ .3/2 / 3/2,1/2 \ Clfc 2 

fry Equation |7] rani in fime -^-f — I - — " — ) and provides mean estimates which, with probability 



K 

greater than 1 — <5, are within e of their corresponding true values. 
Proof: The proof has several parts. 

SVD projection: We have shown in Lemma Q] that after projecting to SVD space (using a sample of size 
poly ^ - - - ^ -log (§)), we need to estimate the parameters of the mixture in ~R k , p (x) = J2i=i a-iK{x, i/,*) 
where we must estimate the means within | error. 

Grid Search: Let us denote the parameters^ of the underlying mixture p a (x, 9) by 

9 = (m, a.) = (v\, . . . , i/fc, a) € R fc +fc and any approximating mixture p (x, 9) has parameters 9 = 
(m, a). We have proved the bounds f\ (djj(m, m)) < ||p(a;, 9) — p(x, 9)\\ < /2(<ijj(Tn, m) + ||a — 
(see Theorem 12 Lemma |3J, where /i and /2 are increasing functions. Let G be the step/grid size (whose 
value we need to set) that we use for gridding along each of the fc 2 + k parameters over the grid Ma- We note 
that the L 2 norm of the difference can be computed efficiently by multidimensional trapezoidal rule or any 
other standard numerical analysis technique (see e.g., J4j). Since this integration needs to be preformed on a 

(fc 2 + fc) -dimensional space, for any pre-specified precision parameter e, this can be done in time (-) 0< " k \ 
Now note that there exists a point 9* = (m* , a*) on the grid Mq , such that if somehow we can identify this 
point as our parameter estimate then we make an error at most G/2 in estimating each mixing weight and 
make an error at most G\fkj2 in estimating each mean. Since there are fc mixing weights and fc means to be 

estimated, ||p o (a;,0) -p o (a:,0*)|| < f2(d H (m,m*) + \\a - a*|ji) < / 2 (G) = 2^^% G - Vmi ^ 

h (d H (m, m*)) < \\ Po (x, 9) - Po (x, 9* )|| < f 2 (G) 



Now, according to Lemma[T7] using a sample of size O 
such that with probability greater than 1 — |, 



log(2/<?) 



we can obtain a kernel density estimate 



\\pkde -p (x,9)\\ < e* (2) 

By triangular inequality this implies, 

/ 1 (d H (m,m*))-c < \\p kde -p (x,9*)\\ </ 2 (G)+e* (3) 

Since there is a one-to-one correspondence between the set of means of m and m* , du(m, m*) essentially 
provides the maximum estimation error for any pair of true mean and its corresponding estimate. Suppose 
we choose G such that it satisfies 

2e* + / 2 (G) </i(|) (4) 

For this choice of grid size, Equation [3] and Equation|4]ensures that /1 (dnirn, m*)) < /2(G) + 2e* < 
/1 (|). Hence <ifl-(m,m*) < |. Now consider a point 9 N = (m N ,a N ) on the grid Mq such that 
(ig(m,m w ) > |. This implies, 

h{d H {m,m N )) > A (|) (5) 

Now, 

\\p (x,9 N )-p kde \\ > \\ Po (x,9 N ) - p (x,9)\\ - \\ Po (x,9) - p kde \\ 

> /1 (d H (m,m N )) - 

>/i(f)-e* 
>/ 2 (G)+e, 

e 

> \\Po(x,9*) -Pkde\\ 

where, inequality a follows from triangular inequality, inequality b follows from Equation|2] strict inequality c 
follows from Equation|5] inequality d follows from Equation|4]and finally inequality e follows from Equation 
[3] Setting = i/i (|), Equation|4]and the above strict inequality guarantees that for a choice of Grid size 



4 To make our presentation simple we assume that the single parameter variance is fixed and known. Note that it can 
also be estimated. 



G = f 2 1 (^/i (|)) = (%0z ) (s^P') C ' 1 ' C ^ e solution obtained by equation Q] can have mean estimation 
error at most | . Once projected onto S VD space each projected mean lies within a cube [— y/^, \p^\ k ■ With 



3 / 2 \ / 3/2 1 / 2 \ ^ 

the above chosen grid size, grid search for the means runs in time ( ^jj- ) • ( - — f — ) . Note that grid 

/ j.3/2 \ / , 2 \ Cik 2 

search for the mixing weights runs in time I -^75- 

We now show that not only the mean estimates but also the mixing weights obtained by solving Equation 
Q] satisfy |a* — a,| < e for all i. In particular we show that if two mixtures have almost same means and 
the L 2 norm of difference of their densities is small then the difference of the corresponding mixing weights 
must also be small. 

Corollary 5 With sample size and grid size as in Theorem® the solution of Equation\l\provides mixing 
weight estimates which are, with high probability, within e of their true values. 

Due to space limitation we defer the proof to the Appendix. 
3.1 Lower Bound in 1 -Dimensional Setting 

In this section we provide the proof of our main theoretical result in 1-dimensional setting. Before we present 
the actual proof, we provide high level arguments that lead us to this result. First note that Fourier transform 

of a mixture of k univariate Gaussians q(x) = $3 i=1 onK(x, /ij) is given by 

Hq)( u ) = I li x ) exp{-iux)dx = -j= £* =1 o-j cxp (-\{a 2 u 2 + i2u(i j )) 

Thus, ||.F(<7)|| 2 = ^ / I Y^ k j=i a j ex P( — * u Mj)| 2 exp(—a 2 u 2 )du. Since L 2 norm of a function and its 
Fourier transform are the same, we can write, 

Ikll 2 = ^ J IEj=i a j cxp(-iufij)\ 2 exp(-cr 2 u 2 )du. 

Further, J \ YLj=i a j cxp(— * M Mi)| 2 exp(— a 2 u 2 )du = i J \ YLj=i a j exp(iu^.,)| 2 exp(— a 2 u 2 )du and 
we can write, 

h\\ 2 = ^j \g(u)\ 2 cM~« 2 u 2 )du 

where g(u) = X)j=i a j sxp{ifiju). This a complex valued function of a real variable which is infinitely 
differentiable everywhere. In order to bound the above square norm from below, now our goal is to find an 
interval where \g(u) \ 2 is bounded away from zero. In order to achieve this, we write Taylor series expansion 
of g(u) at the origin using (k — 1) terms. This can be written in matrix vector multiplication format g(u) = 

v^Aol + 0(u k ), where it* = [1 u | ■ ■ ■ (fc-i); ]' sucrl ^ a ca ptures the function value and (k — 
1) derivative values at origin. In particular, ||Aa|| 2 is the sum of the squares of the function g and k — 

1 derivatives at origin. Noting that A is a Vandermonde matrix we establish (see Lemma [T2I1 ||Aa|| > 

\ fc-i 

■ This implies that at least one of the (k — 1) derivatives, say the j th one, of g is bounded 

away from zero at origin. Once this fact is established, and noting that (j + l) th derivative of g is bounded 
from above everywhere, it is easy to show (see LemmafTOli that it is possible to find an interval (0, a) where 
jth d er i va tive of g is bounded away from zero in this whole interval. Then using LemmafTTl it can be shown 
that, it is possible to find a subinterval of (0, a) where the (j — l) th derivative of g is bounded away from 
zero. And thus, successively repeating this Lemma j times, it is easy to show that there exists a subinterval 
of (0, a) where \g\ is bounded away from zero. Once this subinterval is found, it is easy to show that |jg|| 2 is 
lower bounded as well. 

Now we present the formal statement of our result. 

Theorem 6 (Lower bound in M) Consider a mixture of k univariate Gaussians q(x) = 5Zi=i ctiK{x, fii) 
where, for all i, the mixing coefficients ol% £ (—1,1) and the means /jli G [—y/n,y/n\. Suppose there 
exists a such that minj — fij\ > t, and for all i,\oti\ > a min . Then the L 2 norm of q satisfies 

I M | 2 > a min (n) ^ wnere C is some positive constant independent ofk. 
Proof: Note that, 

hf = ^j \g{u)\ 2 zM-° 2 u 2 )du 



where, g(u) = Ylj=i a j cxp(i/iju). Thus, in order to bound the above square norm from below, we need 
to find an interval where g(u) is bounded away from zero. Note that g(u) is an infinitely differentiable 
function with n th order derivative0 9^ n \u) = Sj=i a j{^l JL j) n exp(ifj,ju). Now we can write the Taylor 
series expansion of g(u) about origin as, 

g(u) = ,g(0) + ff «(0)£ + g^(0)^ + ... + «,(*-D(0)|^ + 0(u k ) 
which can be written as 



9{u) = 



1 u 



1 1 1 

i/ii i> 2 

(i/^i) 2 («A*2) 2 («/i 3 ) 2 

M*" 1 M*" 1 M*" 1 



1 



O'l 

a-2 



+0{u k ) 



Note that matrix A is Vandermonde matrix thus, using LemmaElthis implies | 5 (0)| 2 + |.9 (1) (0)| 2 + • • • + 

/ \2(fe-l) , v2(H) 

| 5 Cfc-D(0)|a 
l.9(0)| 2 > - 



> a 2 ■ f * 1 

— mm ^v/n I 



This further implies that either 

, 2(fe-l) 2 / \ 2(fc-l) 

or there exists a j e {1,2, ■■■ , fc-1} such that | ff W(0)| 2 > ^f* ( 

In the worst case we can have j = k — 1, i.e. the (k — l)-th derivative of g is lower bounded at origin and we 
need to find an interval where g itself is lower bounded. 

Next, note that for any u, g^ (u) = J2j=i a j{^j) k exp(iufjij). Thus, \g^\ < J2j=i < 
amax(v / "-) fe - Assuming t < 2 v / n, if we let M = \2^/n) > tnen usm 8 Lemma [TUl if we choose 

f^) fc , and thus, in the interval [0,ol, Ig^^l > # = ^fe ( . 



-^C= in the interval 

2V2 



2 v / 2Q max ( v ^) fc 

2 / \2l 

This implies |i?e[g (fc_1) ]| 2 + |/m[5 (fc_1) ] | 2 > ( 57^ ) ■ For simplicity denote by h = Re[g], thus, 

^(fe-i) _ fl e [gr(*-i)] and without loss of generality assume > (2^) 

(0, a). Now repeatedly applying Lemma [TTIffc — 1) times yields that in the interval 
any other subinterval of length -^hr within [0, a]) 

1^1 > 2^(f^o)(o^) ' (el^r) = (iTf) (t) f^ lEg ) ~ 



(3 fc - 



-r^a, a , (or in 



2 fc 3 — s~ 



^ 3>,-i J a,aj. 



In particular, this implies, \g\ 2 > \h\ 2 > a °°^" 3 fc ° +fc — in an interval 

Next, note that < a < 1 => cxp(— cr 2 ) < cxp(— a 2 a 2 ). Now, denoting /3x = ^t-T^ Q, /3 2 = a, we 
have, 

N| 2 > h^l \9(u)\ 2 exp(~a 2 u 2 )du > ^\g{fi 2 )\ 2 exp^a 2 ) 



CX P( — a ) Q min 
27T 



)( 



= / exp(-<r 2 ) \ o^r^a 2 ^ 
1 27r / 2 2fc 3 fc2 + 2fc - 1 

/ exp(- t 7 2 )a^ i + 3 \ / t 2 k 2+3k 

— I 2-ir J I 2 2fc2 + 5fc + 9 / 2 3 fe2 + 2fc_1 fc' ! + 3 / 2 n 

/ t 2fc 2 +3fc \ 
\^ 2 0(fc 2 logn) J 



f 2fe^+3fc 

22fc 2 +5fc + 9/2 3A; 2 + 2A;-l ^ amax J2A; + l^A; + 3/2 n 2fc 2 +2fc 



> a 



,o(fc 2 ) 



mm V n / 

where the last inequality follows from the fact that if we let, 

F(k) = 2 2fc2 + 5fc + 9 / 2 3 fe2 +2fc-ifcfc+3/2 n fc 2 +2fc then taking i og w j m base 2 on both sides yields, 

log(F(fc)) = (2A: 2 + 5fc + 9/2) + (fc 2 + 2/c - 1) log 3 + (k + 3/2) log k + {2k 2 + 2k) log n = 0(k 2 log n). 

Thus, F{k) = 2°< fc2 lo s™) = n° (k2 \ ■ 



5 Note that Fourier transform is closely related to the characteristics function and the n th derivative of g at origin is 
related to the n th order moment of the mixture in the Fourier domain. 



3.2 Determinant of Vandermonde Like Matrices 



In this section we derive a result for the determinant of a Vandermonde-like matrix. This result will be useful 
in finding the angle made by any column of a Vandermonde matrix to the space spanned by the rest of the 
columns and will be useful in deriving the lower bound in Theorem[6] 
Consider any (n + 1) x n matrix B of the form 



B = 



1 


i 


1 • 


1 


Xl 


X 2 


X3 ■ 












4 


x 2 


4 ■ 


• x 2 


>•" 






~.n 


■< i 


Xq 


x 3 





If the last row is removed then it exactly becomes annxn Vandermonde matrix having determinant II j> j (xi — 
Xj). The interesting fact is that if any other row except the last one is removed then the corresponding n x n 
matrix has a structure very similar to that of a Vandermonde matrix. The following result shows how the 
determinants of such matrices are related to Hi > j(xi — Xj). 

Lemma 7 For 1 < i < (n — 1), let Bi represents the nx n matrix obtained by removing the i th row from B. 
Then dct(Bi) = CiH s> t(x s — Xt) where a is a polynomial having (•"-,) terms with each term having degree 
(n — i + 1). Terms of the polynomial Cj represent the possible ways in which (n — i + 1) XjS can be chosen 
from {x t }™ =1 . 

Proof: First note that if a matrix has elements that are monomials in some set of variables, then its determinant 
will in general be polynomial in those variables. Next, by the basic property of a determinant, that it is zero if 
two of its columns are same, we can deduce that for 1 < i < n, det(Bi) = if x s = x t for some s ^ t, 1 < 
s,t < n, and hence gi(xi, X2, x„) = det(Bi) contains a factor p(x\, X2, x„) = H s> t(x s — Xt). Let 
<?i(xi,x 2 , ...,x„) = p(xi,x 2 , ...,x n )r i (x 1 ,x 2 , —,x n ). 

Now, note that each term of p(xi, X2, x„) has degree 0+ 1 + 2 + ... + (n — 1) = ? ^" 2 . Similarly, each 
term of the polynomial Qi{x\, x 2 , x„) has degree (0 + l + 2 + ... + n) — (i— 1) = " ( -" 2 +1 - > — (i — 1). Hence 

each term of the polynomial ri(x\,X2, x n ) must be of degree n *-" 2 +1 ^ — (i — 1) — K( -" 2 1 - ) = (n — i + 1). 
However in each term of T"j(xi, x 2 , x„), the maximum power of any Xj can not be greater than 1. This 
follows from the fact that maximum power of Xj in any term of qi(xi,X2, x„) is n and in any term of 
p(xi,X2, ...,x n ) is (n — 1). Hence each term of r.;(xi,X2, ■■■jXn) consists of (n — i + 1) different XjS and 
represents the different ways in which (n — i + 1) x^s can be chosen from {xi}™ =1 . And since it can be done 

in („-" + i) = G-i) wa y s there win be G-i) terms in n(xi, x 2 , x n ). ■ 
3.3 Estimation of Unknown Variance 

In this section we discuss a procedure for consistent estimation of the unknown variance due to |[T6l (for the 
one-dimensional case) and will prove that the estimate is polynomial. This estimated variance can then be 
used in place of true variance in our main algorithm discussed earlier and the remaining mixture parameters 
can be estimated subsequently. 

We start by noting a mixture of k identical spherical Gaussians Xa=i a i-^{^ii <j2 ^) m ^™ projected on 
an arbitrary line becomes a mixture of identical 1 -dimensional Gaussians p(x) = XL=i aiA/"(/ii, a 2 ). While 
the means of components may no longer be different, the variance does not change. Thus, the problem is 
easily reduced to the 1 -dimensional case. 

We will now show that the variance of a mixture of k Gaussians in 1 dimension can be estimated from 
a sample of size poly (-, |), where e > is the precision ,with probability 1 — 5 in time poly ¥). This 
will lead to an estimate for the n-dimensional mixture using poly (n, -, j) sample points/operations. 

Consider now the set of Hermite polynomials ji(x,r) given by the recurrence relation 7i(x, r) = 
X7i_i (x, r) — (i— l)r 2 7i_2(^ 7 t), where 70 (x, r) = 1 and 71 (x, r) = x. Take M to be the (fc+ 1) x (fc + 1) 
matrix defined by 

Ma = E p [ 7i+j {X, r)], < i + j < 2k. 

It is shown in Lemma 5 A of lfl6ll that the determinant det(M) is a polynomial in r and, moreover, that the 
smallest positive root of dct(M), viewed is a function of r, is equal to the variance a of the original mixture 
p. We will use d{r) to represent det(M). 

This result leads to an estimation procedure, after observing that E p [7j + j(X, r)] can be replaced by its 
empirical value given a sample Xl, X 2 , Xjy from the mixture distribution p. Indeed, one can construct 



the empirical version of the matrix M by putting 



^^IvE^+i^'^ 0<i+j<2k. (6) 
t=i 

It is clear that d(r) = det(M)(r) is a polynomial in r. Thus we can provide an estimate a* for the variance 
(7 by taking the smallest positive root of d(r). This leads to the following estimation procedure : 

Parameter: Number of components k. 

Input: N points in R™ sampled from ^i=i otiNdx^, a 2 1). 
Output: a*, estimate of the unknown variance. 

Step 1. Select an arbitrary direction v 6 M" and project the data points onto this direction. 
Step 2. Construct the (k + 1) x (fc + 1) matrix M(r) using Eq.0 

Step 3. Compute the polynomial <1{t) = dct(M)(r). Obtain the estimated variance a* by approximating 
the smallest positive root of d(r). This can be done efficiently by using any standard numerical method or 
even a grid search. 

We will now state our main result in this section, which establishes that this algorithm for variance esti- 
mation is indeed polynomial in both the ambient dimension n and the inverse of the desired accuracy e. 

Theorem 8 For any e > 0, < S < 1, if sample size N > O ^ " P e /^ - ^, then the above procedure provides 
an estimate a* of the unknown variance a such that \a — a* \ < e with probability greater than 1 — S. 

The idea of the proof is to show that the coefficients of the polynomials d(r) and g?(t) are polynomially 
close, given enough samples from p. That (under some additional technical conditions) can be shown to 
imply that the smallest positive roots of these polynomials are also close. To verify that d(r) and d(r) are 
close, we use the fact that the coefficients of d(r) are polynomial functions of the first 2k moments of p, while 
coefficients of d(r) are the same functions of the empirical moment estimates. Using standard concentration 
inequalities for the first 2k moments and providing a bound for these functions the result. 

The details of the proof are provided in the AppendixICl 
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Appendix 



A Proof of Some Auxiliary Lemmas 

Lemma 9 For any v\,v 2 £ R" and any ai, a 2 £ R, ||ai«i + 0(2^2 1| > sin(/3)| where j3 is the 

angle between V\ and v 2 . 



Proof: Let s G R such that (ai«i + sv 2 ,v 2 ) = 0. This implies s 



|ai«i + a^i^H 2 = || (ckii>i + sv 2 ) + {a 2 - s)v 2 \\ 2 

Vi + sv 2 

oi(vi,V2) 



01(1)1,1)2) 
' — n — in — ■ 

\\V-2 || - 



|ai«i + SW2II + ll( a 2 - s)w 2 || > \\ a i v i + s f 2I 



^Q!i«i + sv 2 ,a 1 v 1 + sv 2 ) = {aivi + sv 2 ,a 1 v 1 ) = o;f || f 1 1| 2 + ais{v 1 ,v 2 ) 



ai 



vi,v 2 ) 

2IU. 112, 



vci^ii n« 2 iicos(^r 



a 2 |KP(l-co S 2 (/3)) = a!H||W(/?) 



Lemma 10 Let h : R — > C foe an infinitely differentiable function such that for some positive integer n and 
real M,T > 0, |fo>>(0)| > M and |fo> +1 )| < T. Thenforany < a < |fo>'| > M — V2Ta in the 

interval [0, a]. 

Proof: Using mean value theorem for complex valued function, for any x £ [0, a], \hS n \x) — h^ n \0)\ < 
V2Ta, which implies M - \h^ n) (x) | < s/2Ta. ■ 



Lemma 11 Let h : R — > R foe an infinitely differentiable function such that for some positive integer n and 
real M > 0, |fo>>| > A/ in an interval (a, 6). Then \h^ n -^\ > M(b - a)/6 in a smaller interval either in 
( a ,^±b)orin (2^,6). 

Proof: Consider two intervals ii = (a, 2a 3 l " b ) and 7 2 = ( a "g 2 , 6). Chose any two arbitrary points x £ 

^1 1 2/ £ ^2 ■ Then by mean value theorem, for some c 6 (a, 6), |/i^ n_1 ^(a;) — /i^ n_1 ^(j/)| = |fo'"-'(c)||a; — y| > 
M(6-a)/3. 

If the statement of the Lemma is false then we can find x* £ /1 and € I 2 such that |fo(" _1 ' (x*)| < 
M(b - o)/6 and l/i^" 1 )^*)! < M(6 - o)/6. This implies l^™" 1 )^*) - fof"- 1 )^)! < M(b - o)/3. 
Contradiction. ■ 

Generalized cross product: 

Cross product between two vectors fi,t>2 in R 3 is a vector orthogonal to the space spanned by v ll v 2 . 
This idea can be generalized to any finite dimension in terms of determinant and inner product as follows. 
The cross product of (n — 1) vectors v\, ...,v n -\ £ R" is the unique vector u £ R™ such that for all 
z £ R n , (z,u) = dct[ui, v n -\, z\. With this background we provide the next result for which we 
introduce the following k x k Vandermonde matrix A. 



A 



1 


1 


1 


1 


X\ 


x 2 














4 


■ r 2 


4 ■ 


■ 4 



.fc-i 



Lemma 12 For any integer k > 1, one/ positive a, t £ R, /ef xi, x 2 , Xfc £ [—a, a] one/ f/iere erisfs an Xi 
such that t = mmjjj£i\xi — Xj\. Let a = (ai, a 2 , ck^) £ R fe w/ffo min, |aj| > a m i n . ThenforAas 

defined above, j|j4o:|| > a m j n 



fe-i 

t 



Proof: We will represent the i th column of A by Vi £ R fc . Without loss of generality, let the nearest point 
to Xk be at a distance t. Then ||Aa|| = Wa^Vk + J2i=i a i v i\\- Note that J2^ = i ctiVi lies in the space 
spanned by the vectors {vi}^~l, i.e., in span{vi,v 2 , ...,Vk-i}- Let u £ R fc be the vector orthogonal to 
span{vi 1 v 2 , Vk-i} and represents the cross product of Vi, v 2 , v^-x- Let f3 be the angle between u 
andvk- Then using Lemma [9] ||Aa|| > |afc| ||«fc|| | sin(90 — 0) \ > a m i n \\vk\\ | cos(/3) | . Using the concept 
of generalized cross product 

(u,v k ) = det(A) => \\v k \\ |cos(/3)| = ^f^ 1 (7) 



Let A = [vi,v 2 , ...,Vk-i] G R fex ( fc - 1 ). Note that ||u|| 2 = Yli=i(^ et (^)) 2 where A t represents the 
(k — 1) x (k — 1) matrix obtained by removing the i* 71 row from A. Since each \xi\ < a, and for any integer 
< b < k - 1, (V) = (fc-7-b)' usin 8 LemmaE] 

\\uf = IL^s^Xs - X t f (l + (( fc 7>) 2 + ((YK) 2 + - + ((felD^" 1 ) 2 ) 

< Hfc-i^^xlxa - *t| 2 (l + (V)« + (Y)a 2 + ... + it-Da*- 1 ) 2 

< (ll k -i> B>t >i\x a - x t \ 2 ) (l + a) 2 ^" 1 ). 

where, the first inequality follows from the fact that for any bi, &2, . . . , b n > 0, Y^i=i — (S™=i ^*) an< ^ 
the second inequality follows from the fact that for any c > 0, and positive integer n, (c+l) n = X)"=o (™) c *- 
Since det(A) = n fe > s>t >i = U k ^ 1 > s>t > 1 (x s - x t )U k ^ 1 > r > 1 (x k - x r ). Plugging these values in Equation 

Qyields, \\v k \\ | cos(/3)| > "^^.r^ > ■ This implies, > a min * 

Lemma 13 Consider any set ofk points {a?i}f =1 in W l . There exists a direction v £ R", |t> | = 1 such for 
anyij \{x i} v) - (x j ,v) \ > l|x ' fc3 a:jl1 . 

Proof: For k points {a5i}* =1 , there exists (Jj) directions ij^Z^'jj , h 3 = 1, 2, . . . , fc obtained by joining all 

possible pair of points. Let us renumber these directions as Uj, i = 1,2, •■•,(2)- Now, consider any arbitrary 
direction Wj formed using points a; m and a;„ respectively. If aijS are projected to any direction orthogonal 
to Uj, then at least two XjS, x m and x„ coincide. In order to show that there exists some direction, upon 
projecting the a^s on which, no two a^s become too close, we adopt the following strategy. Consider a n 
dimensional unit ball S centered at origin and place all Uj, j = 1,2,... ,(^) directions vectors on the ball 
starting at origin. Thus each Uj is represented by a n dimensional point on the surface of S. For any Uj, 
consider all vectors v £ R™ lying on a manifold,- the (n — 1) dimensional unit ball having center at origin 
and orthogonal to Uj. These directions are the "bad" directions because if a^s are projected to any of these 
directions then at least two XjS coincide. We want to perturb these "bad" directions a little bit and form an 
object and show that we can control the size of an angle such that volume of union of these (z) objects is 
much less than the volume of S, which implies that there are some "good" directions for projection. 

Consider any < (3 < For any ix ; , let C, = {x £ S : arcsin ^r~rp < is the perturbed version 

of a bad direction and we do not want to project a^s on any direction contained in d. The volume of Cj is 
shown in the shaded area in FigureQ] A simple upper bound of this volume can be estimated by the volume 

of a larger n dimensional cylinder C of radius 1 and height 2 sin(/3). Let C = U^f/jC, . Thus, total volume of 

C is vol{C) = u21«oJ(Cj) < U^voliC'i) < k(k - 1) sin(/3) x ( izL^— ] . Note that vol(S) = Y^TTj- 



fc(fc-l)sinC8) ^ W + l) 



We want vol(C) < vol(S). This implies 

fc - 1) sin(ft) ] 

"r(f + 1) 

Now we consider two cases. 
case 1: n is even 

From the definition of Gamma function denominator of r.h.s of Equation[8]is f ||) ! Since n — 1 is odd, using 
the definition of Gamma function, the numerator of Equation [8] becomes V"^"^ 1 )' 1 = 2 ^}™~ 2 1 ?.' using the 

fact that (2n + 1)!! = (2 2 "+V ! . Thus r.h.s of Equation[8]becomes 20F (a^TTjlfp) < x \ = V*, 
where the last inequality can be easily shown as follows, 

/ (n-l)\ \ _ 1 (n-l)(n-2)(n-3)(n-4)-l _ 1 ( (n-l)(ra-3)(ra-5)-l \ 1 

I 2™(2-l)!(f )! J 2 [(n-2)(n-4)(n-6) — 2][(n(n-2)(n-4) — 2] 2^ n(n-2)(n-4)— 2 J^2' 



case 2: n is odd 

n — 1 is even and thus numerator of r.h.s of Equation [8] becomes f 1 ^)' The denominator become , 

2~ S - 

which in turn is equal to using the relation between double factorial and factorial. Thus, r.h.s of 

2 n {— — ). 

Equation[8]becomes - 2 J,' 1 ' 2 ^ < x2 = The last inequahty follows from the fact that, 

( 2"(^)!(^)! ^ „ 2[(n-l)(n-3)(n-5)-2][(n-l)(n-3)(Ti-5)-2] _ 9 ( (n-l)(re-3)(n-5)-2 



n(n-l)(n-2)(n-3)(n-4)---l I n(n-2)(n-4)---3.1 



— ^ „^„ — ^vtz — r; — n — v. ^ 



Figure 1: For itj, the shaded region represents the volume Ci, which corresponds to all the "bad" directions 
associated with it;. An upper bound for Ci is C i which is a cylinder with radius 1 and height 2 sin (3 and is 
shown by the rectangular box. 



Thus, for any n, to ensure existence of a good direction, we must have 



fc(fc-l) sin(/3) 



< ^/tt which implies 



Once f3 is chosen this way, volume of C is less than volume of S and hence 
1, such that if a^s are projected along this "good" direction v, 



sin(/3) < kprryy- Fixing /3 small enough, in particular setting f3 — f3* such that sin(/3*) = p- satisfies strict 
inequality sin(/3*) < oprn 
there exists some "good" direction v, \\v\\ 
no two (x.i, v) becomes too close. Now consider any v on the surface of S which is not contained in any 

of the C { s and hence in any of the C^s. This implies for any i, j, \ (v, ||^'_^ J I ) \ > sin(/3* ) = -p-, and hence 

\(v,(Xi-Xj))\ > ■ 

Note that the above Lemma can also be considered as a special kind of one sided version of Johnson- 
Lindenstraus Lemma, specifically, when equivalently expressed as,- for given small enough (3 > (hence 
sin(/3) ss and vector y = Xi — Xj, with probability at least 1 — 0((3), a random unit vector v has 
the property that the projection of y on to the span of v has length at least However, our result is 

deterministic. 



Lemma 14 Let g :R k - 

basis of R fe and let g\ : 

c>0,||. 9 || 2 > 



^ ca / 



. be a continuous bounded function. Let v, Ui, Mfc-i £ M. k be an orthonormal 
> M be defined as g\{v) =[•••[ g(v, u\, ...Uk-\)du\ ■ ■ ■ diik-i- Then for some 



I 



Proof: Note that ||_g|| 2 = /(/•••/ \g(v, m, ■ ■ ■ , Uk-i)\ 2 dui ■ ■ ■ duk-i) dv and, 

II fill 2 = / \9i{ v )\ 2 dv = J (J ■ ■ ■ J g(v, m, • ■ ■ , Uk-i)dui, ■ ■ ■ duk-i) 2 dv. For any sufficiently large L > 
we concentrate on a bounded domain A = [-L, L] k C M. k outside which the function value becomes arbi- 
trarily small and so do the norms. Note that this is a very realistic assumption because component Gaussians 
have exponential tail decay, thus selecting L to be, for example, some constant multiplier of er, will make 
sure that outside A norms are negligible. We will show the result for a function of two variable and the same 
result holds for more than two variables, where, for each additional variable we get an additional multiplica- 
tive factor of 2L. Also for simplicity we will assume the box to be [0, 2L] 2 as opposed to [— L, —L] 2 . Note 
that this does not change the analysis. 



We have, \\g\\ 2 = J 2L J 2L g 2 (v, ui)dvdu\ and ||.gi|| 2 = I^Jq 1 " g(v, ui)dui J dv. By change of 
variable, x = we have J 2L g(v, u{)dui = 2L g(v, 2Lx)dx. Here dx acts as a probability measure 



,-2L 



r 2L 



and hence applying Jensen's inequality we get 

-I 2 



2L 



g{v, u\)dui 



[2Lf 



g(v, 2Lx)dx 



<(2L) 2 g 2 (v,2Lx)dx = 2L / g 2 (v, Ul )du 



2L 



where the inequality follows from Jensen's and the last equality follows by changing variable one more time. 
Thus, 



llflil 



2L 



2L 



g(v,ui)dui 



' 2 j.2L 


p2L 




dv < 


2£ / 9 2 {v,u 1 )du 1 


dv 


Jo 


Jo 





m\g\ 



For each additional variable we get an additional multiplicative 2L term, hence, 

||5i|| 2 <(2L) fc - 1 ||<?|| 2 <(2L) fc || 5 || 2 . 



A version of the following Lemma was proved in 11231 . We tailor it for our purpose. 

Lemma 15 Let the rows of A £ WL Nxn be picked according to a mixture ofGaussians with means Mi, M2i • • • j 
Mfc € M n , common variance a 2 and mixing weights oc\, a 2 , ■ ■ ■ f &k with minimum mixing weight being a m i n . 
Let m 1; ft 2 , . . . , fi k be the projections of these means on to the subspace spanned by the top k right singular 
vectors of the sample matrix A. Then for any 0<e<l,0<<5<l, with probability at least 1 — 5, 

||m 4 - Aill < provided N = (log (^-) + ^ log(±))), 

Proof: First note that from Theorem 2 and Corollary 3 of 11231 , for < e < i with probability at least 1 — 5, 
we have, Ei=i w;(||Mi|| 2 — ||mJ 2 ) < ~ k)a 2 provided, 

N = n (V- fnlpg ( ? + max W!) + —Jo g (b)) (9) 
\e amin V V e 1 to 2 J n-k 5 J J 

Now setting e = e ^fc" , we have ||m, — Mill 2 = HmJ 2 — IIAJ 2 < e 2 c 2 - Next, setting e = 2ea yields 

2 

the desired result. Note that for this choice of e, e = 4 ^ 2 "°'i" fc ) ■ Further, restricting < e < 1, yields 

e < i^^Zfc] < | as required. Now noticing that ||mJ < y/n and plugging in e = 4 ^ 2 "™ fc ^ in Equation [9] 
yields the desired sample size. ■ 

In the following Lemma we consider a mixture of Gaussians where the mixing weights are allowed to 
take negative values. This might sound counter intuitive since mixture of Gaussians are never allowed to take 
negative mixing weights. However, if we have two separate mixtures, for example, one true mixture density 
p{x) and one its estimate p(x), the function (p—p) (x) that describes the difference between the two densities 
can be thought of as a mixture of Gaussians with negative coefficients. Our goal is to find a bound of the L 2 
norm of such a function. 

Lemma 16 Consider a mixture of m k-dimensional Gaussians f(x) = E»=i a iK( x -> v i) where the mixing 
coefficients on G (—1,1), i = l,2,...,m. Then the L 2 norm of f satisfies ||/|| 2 < ( ( 27r g 2 ) fc ) cx T Ka, 
where K is a m x m matrix with Kij = cxp (— ^' 2o .2 J ^ ^ ond a = {cx\, a 2 , . . . , a m ) T . 



Proof: Let the kernel t{x, y) = cxp f — 2 Ji J defines a unique RKHS %. Then 

ll £\\1 /v^"i 1 / I a; — vA\ 2 \ \-^m 1 / las— i^l| 2 \\ 

ll/ll« = (Ei=i (vW ex P r^^-J 'S*=i T7^ exp (-^^P n 

= ((2^yr) (ESi^C^iiOiESiOi*^-^ 

= ((2^yr) {E2=i«< +Eij,i#j«i«i<*(» / i»0.*(»'j»0>«} 

= ((2^yr) {EHi + Ei^j <WK "3)} = ((d^) ^Ka. 

Since L 2 norm is bounded by RKHS norm the result follows. ■ 

Proof of Corollary |5] Proof : Consider a new mixture p (x, m, a*) obtained by perturbing the means of 
Po(x, m*, a*). For ease of notation we use the following short hands p a = p a (x, m, ot),p = p (x, m, a*) 
and p* Q = p (x,m*,a*). Note that the function f\ mentioned in Theorem|4] which provides the lower bound 
of a mixture norm, is also a function of k and a m j n . We will explicitly use this fact here. Now, 

I bo - Poll < bo -p || + 1 bo - Poll 

a 

< 2|bo - Poll < 2 (WPkde "Poll + \\Pkde - Poll) 
J 2 (/ 2 (G) +€,+£,) 

<2/i(f) =2/i(2fc,a m5n ,f) 
where in equality a follows from the fact j|p* — p a \\ < \\p a — p*\\ dictated by the upper bound of Lemma[3] 
inequality b follows from Equation [2] and [5] and finally inequality c follows from Equation|4] 

It is easy to see that /i(fc, /3 max , d m in/2) < \\p - po\\ where /3 max = max i {|a l - 6n\}. In order to see 
this, note that p Q — p a is a mixture of k Gaussians with mixing weights (a* — on) and minimum distance 
between any pair of means is at least ^f^. This is because after projection onto SVD space each mean can 
move by a distance of at most |. Thus, minimum pairwise distance between any pairs of projected means is 



at least d m i n — e > ^fiL since e < ^f^. Now, choose the Gaussian component that has absolute value of the 
mixing coefficient /3 max and apply the same argument as in Theorem [2] (Note that in Lemma fT2l we do not 
need to replace /3 max by /3 min ). 

Combining lower and upper bounds we get fi(k, /3 m ax, ^g" 1 ) < \\p a — p a \\ < 2fi(2k, a m - m , |). Simpli- 
fying the inequality fi(k, /3 max , ^p) < 2/i(2fc, a min , §) and solving for /3 max yields 

/?max = maxi |ai — a.%\ < -ffit ( 256n 3 fc ti ) ^ or some positive C2 independent of n and k. Clearly, 
/3max < e. ■ 



B Finite Sample Bound for Kernel Density Estimates in High Dimension 

Most of the available literature in kernel density estimate in high dimension provide asymptotic mean inte- 
grated square error approximations, see for example 11241 . while it is not very difficult to find an upper bound 
for the mean integrated square error (MISE) as we will show in this section. Our goal is to show that for 
a random sample of sufficiently large size, the integrated square error based on this sample is close to its 
expectation (MISE) with high probability. 

We will start with a few standard tools that we will require to derive our result. 
Multivariate version of Taylor series: 

Consider the standard Taylor series expansion with remainder term of a twice differentiable function / : K — > 



/(*)=/(0)+t/'(0)+ / (t-s)f"(s)d 



By change of variable s = tr we have the form 

f(t) = /(0) + f/'(0) + t 2 f (1 - T)f"(tT)dT 

Jo 

Now a consider a function g : R d — > R with continuous second order partial derivatives. For any x, a £ K d 
in the domain of g, if we want to expand g(x + a) around x, we simply use u(t) = x + ta and use the one 
dimensional Taylor series version for the function /(f) = g(u(t)). This leads to, 



g(x + a) = g(x) + a T Vg(x) + (1 — t) {a T T-L g (x + ra)a) dr 

Jo 



(10) 



where T-L g is Hessian matrix of g. 
Generalized Minkowski inequality, see l22l : 

For a Borel function g on R d x M. d , we have 



J (J 9{x,y)dx S j dy < 



1/2 

g 2 (x,y)dy) dx 



Definition 1 Let L > 0. The Sobolev class 5(2, L) is defined as the set of all functions f 



that f G W ' , and all the second order partial derivatives 
multi-index with \a\ — 2 , satisfy 

d 2 f 



such 



where a = {a\, ct2, ■ ■ ■ , ad) is a 



< L 



Let Hf(x) be the Hessian matrix of / evaluated at x. For any / £ 5(2, L), using Holder's inequality it can 
be shown that for any a £ R d J(a T H f {x)a) dx < L 2 (a T a y ' . Note that mixture of Gaussians belongs 



to any Sobolev class. 

Given a sample S = X 2 , 
is given by 



Xjv} the kernel density estimator (•) of true density p(-) £ 5(2, L) 

1 N / V 



p s {x) 



N 
— V 

Nh ^ 

i=l 



(ID 



where A' : M. d — > R is a kernefj function satisfying/ K(x)dx = 1, J xK{x)dx = Oand / x T xK{x)dx < 
00. In particular assume J x T xK(x)dx < C\ for some C\ > 0. Also let J K 2 (x)dx < C2 for some C2 > 



Note that normally kernel is a function of two variables i.e., K 



. However, in nonparametric density 

estimation literature a kernel function is defined as K : R a — > R, where K(x, y) = K(x — y). To be consistent with 
nonparametric density estimation literature, we will call K as our kernel function and denote it by K. 
7 Note that kernel K here is different from the one introduced in Section[2] 



0. Since the sample S is random, the quantity ps(x) and A s (Xi, X2, ■ ■ ■ , Xn) = J [Ps(x) ~ p( x )] 2 dx, 
which is square of the L 2 distance between the estimated density and the true density, are also random. Note 
that the expected value of As, E(A S ) = E J \ps(x) — p{x)] 2 dx is the mean integrated square error (MISE). 
We will show that for sufficiently large sample size, A s is close E(^4 S ) with high probability. 

First fix any Xq. The mean square error (MSE) at point Xq, MSE(xq) = E [(ps(x) — p(a:)) 2 ] , where 
the expectation is taken with respect to the distribution of S = {X\, X2, ■ ■ ■ , Xjy) can be broken down in to 
bias and variance part as follows, MSE(xq) = b 2 (xo) + var(xo) where b(xo) = E(Ps(xq)) — p(xo) and 

var(x a )=E (p s {x ) - E[p s {x )}) 2 . 

Let us deal with the bias term first. By introducing the notation Kh(u) = \H\~ 1 / 2 K(H~ 1 l 2 x) where 
H = h 2 I,Iisadxd identity matrix and h > is the kernel bandwidth along all d directions, we can write 

Ps(x) = ivpr EL K (^) = ^ Ef=i K (H-V\ X - Xi )) = i Ef=i Kh(x X t ) 

Now, E(p s (x )) = EK H (xo - X) = J K H (x - y)p{y)dy = J K(z)p(x - H 1/2 z)dz, where the 
last inequality follows by change of variables. Expanding p(x — H 1 ' 2 z) in a Taylor series around x n , using 
Equation [TOl we obtain 

p(x„-H 1/2 z)=p{x )-(H 1/2 zy\7p{x ) + J^ (1-r) {{h 1 ' 2 H p {x - tH 1 ' 2 : z)H^ 2 z"j dr 
Thus using J K(z)dz = 1 and J zK(z)dz — leads to 



E(p s (as )) = P {x ) + h 2 / K(z) 



(1 - t)z t H p (x q - TH 1/2 z)zdr 



dz 



i.e., b(x Q ) = E(ps(x )) - p(ajo) = h 2 j K{z) [f^l - r)z T H p {x - rH 1 l 2 z)zdr 
Now, 

/ 



dz. 



J b 2 (x)dx = j h 2 J K{z) 
V 



/ (1 - t)z t H p (x - TH 1,2 z)zdr 
Jo 



dz 



dx 



K(z)g(x, z)dz I dx < h 



1/2 " 2 
J (J K 2 (z)g 2 (x,z)dx^j dz 



< h 



<h 4 



J K 2 {z) (l-T)z T H p (x-TH 1/2 z)zdT 
K{z) 

J K(z) (J (1 - T)Lz T zdT^j dz 



1/2 



dx 



dz 



(1 - t)z t H p (x - tH 1/2 z)z 
-, 2 



1 2 


1/2 \ " 


dx 


dr dz 







L 2 h 



c 2 t 2 

Ojlv 4 



where the first and second inequality follows by applying Generalized Minkowski inequality. The third 
inequality follows from the fact that p £ Sob(2, L) and support of p is the whole real line. 

Now let us deal with the variance term. Let T)i(xo) = K ( x °~ h X * ) - E [K The random 

variables ^(xq), i = 1, . . . , N are iid with zero mean and variance 



E[ V 2 (x )] <E 



K 



-2 



Xq - Xi 



K 2 



Xq 



p(z)dz 



Then, 



var(x Q ) = E (p s (x Q ) - E[p s (x )]Y 



= E 



A' 



Nh d 



Nh 2d 



E[r,f(x )] 



< 



Nh 2d 



2 I X{) - z 



p(z)dz 



Clearly. 

var(x)dx < 



Nh 2d 



K 



p(z)dz 



dx 



1 



i-r / K 2 (v)dv < 
Nh d J V ; " 



Nh 2d 

C 2 
Nh d 



A 



dx 



dz 



Now, 



MISE = E{A S ) = E I [ps{x) - p(x)f dx = / E \p s (x) - p(x)f dx 

^T2 C 



MSE{x)dx 



Jb 2 (x)dx + J V ar(x)dx< 



Nh d 



The bias and variance terms can be balanced by selecting h* = 
we have MISE < 



c 2 

CrL- 



[j r ) d+i - With this choice of ft 



Ad 



Note that this is of the order TV d + 4 . Similar expressions for 

bias/variance terms and convergence rate are also known to hold, but with different constants, for asymptotic 
MISE approximations (see l24l ). 

Since mixture of Gaussians belongs to any Sobolev class, the following Lemma shows that we can ap- 
proximate the density of such a mixture arbitrarily well in L 2 norm sense. 

Lemma 17 Let p £ 5(2, L) be a d-dimensional probability density function and K : W 1 — > R be any kernel 
function with diagonal bandwidth matrix h 2 1, satisfying J K(x)dx — 1, J xK(x)dx = 0, J x T xK(x)dx < 
C\ and J K 2 (x)dx < C2 for positive C%, G%. Then for any eo > and any 8 G (0, 1), with probability 

grater than 1 — 5, the kernel density estimate ps obtained using a sample S of size VL ( \ satisfies, 

V c ° 1 J 

J (p{x) -ps{x)f dx < e . 

Proof: For a sample S = {X\, X2, ■ . ■ , Xn} we will use the notation As = As(Xi, X2, . . ■ , Xi, . . . , Xn) 
to denote the random quantity J (p(x) — ps(x)) dx. Note that E(As) = MISE. Our goal is to use a large 
enough sample size so that As is close to its expectation. In particular we would like to use McDiarmid's 
inequality to show that 

2(f) 2 \ 
£f =1 cfj 

, Xi , . . . , x jv ) I < Ci for 1 < i < N. Let 



Pr (a s - E(A S ) >|)<exp - 



where, sup Xi .. iSC . .. \A S ( 



( 1 



\Nh a 



K 



x~Xx 



,x N ) - As{x l7 



h 



x- X 



N 



p(x) I dx 



B, 



Nh d 



K 



x- X x 



K 



x - Xi 



K 



x-X 



N 



p{x) J dx 



Then, 

b, - b, 




K 



x - X,-, 



After integrating, the first term in the above equation can be bounded by ^?j^ d , second term can be 

i(\/C 2 f p 2 (x)dx) .„ 2 „ 

bounded by — ^ '- and the third term can be bounded by Thus, |±Jj — Bi\ < jj^ra + 

N ' Nh d ' 



Note that the optimal choice of h of the order d+i as derived previously does not help to get a tight 
concentration inequality type bound. However, we can choose a suitable h that solve our purpose. To this 
aim, we assume that \Bi — Bi\ is dominated by term j^s, i.e., 

1 < A d2) 

N ~ Nh d 

later we need to show that this is indeed satisfied for the choice of h. Thus, 

C 

d = \Bi - Bi\ = sup \A s (xi,. ..,Xi,.. .,x N ) - A s (xi, . . .,x i; . . .,x N )\ < 

Xi,...,Xi,...,XN ,Xi 1\ fl 

where C is a function of Ci, C2 and L. Now McDiannid's inequality yields 

Pr (A S - E(A S ) > |) < exp = exp (-f^) 

where we have set Nh 2d = N° for some f3 > 0. Setting right side of equation [Til less than or equal to 5, 

we get ^ 2C < N. Now setting f} = 1/d, we get ^ 2C '°^ 7 ^ < iV. For this choice of /3, 

solving Nh 2d = N@ we get h = — Now setting this value of h we get — ^-j = — ^ 1 . For d > 1 

this rate is indeed slower than and hence Equation[T2]is satisfied. Next we check what is the convergence 
rate of MISE for this choice of h. Ignoring the constant terms, the bias terms is of the order h 4 = — 2 (d-i) > 

whereas the variance term is of the order -j-L = — f x . Since the bias term decreases at a much slower rate, 

convergence rate of MISE is dominated by the bias term and hence MISE < — 2^-1) f° r some constant 

AT d 2 

d 2 

C* independent of d and iV. Thus to make sure that MISE = E(A S ) < f, we need f 2(d_1) < TV. 

,2 

Since (^~~J <d < (^o - ) ' (^o~J ~ ^ su ffi ce - However, the number of examples required 

to ensure that with probability greater than 1 — 5, As < K(As) + 4f- is much higher than 

f^f" - ) an d hence for any sample of this size, E(Ag) < 4^-. The result follows. ■ 

For the sake of completeness we present McDiarmid's inequality below. 

Lemma 18 Let X\,X 2 , . . . ,Xn be iid random variables taking values in a set A, and assume that f : 
A N — > K is a function satisfying 

sup |/(aci,ac2,...,xjv) - f(xx,x 2 , ■ . . , acj_i, Xi,x i+1 , . . . , x N )\ < c 4 

X\ ,X2 ,....Xn -Xi 



2C 2 l°g(j) 



for 1 < i < N. Then for any e > 0, 

Pr{f(X 1 ,X 2 ,...,X N )-E[f(X 1 ,X 2 ,...,X N )} > e} < cxp 



2e 2 



c 2 



C Estimation of Unknown Variance 

We now provide the proof of Theorem[8]which combines results from the remainder of this Appendix. 

Proof of Theorem |8l It is shown in Lemma 5 A of [16] that the smallest positive root of the determinant 
d(r) = det(M) (r), viewed is a function of r, is equal to the variance a of the original mixture p and also that 
d[r) undergoes a sign change at its smallest positive root. Let the smallest positive root of d(r) — dct ( M) (r) 
be a. We now show for any e > that a and a are within e given O (j^rg—^ samples. 

In Corollary l20l we show that both d{r) and g?(t) are polynomials of degree k(k + 1) and the highest 
degree coefficient of d(r) is independent of the sample. The rest of the coefficients of d(r) and d(r) are 
sums of products of the coefficients of individual entries of the matrices M and M respectively. 



Note that E(M) = M, i.e., for any 1 < i,j,< (fc + l),E(il/ lJ (r)) = M itj {r). Since M 4J (r) is 
a polynomial in r, using standard concentration results we can show that coefficients of the polynomial 
Mij(r) are close to the corresponding coefficients of the polynomial Mij(r) given large enough sample 

size. Specifically, we show in Lemma |231 that given a sample of size O ( nP J$ ' J each of the coefficients of 

each of the polynomials Mi t j(r) can be estimated within error O ( wPQ f y(fc) ) with probability at least 1 — 5. 

Next, in Lemma l24l we show that estimating each of the coefficients of the polynomial Mi j(r) for all 
i,j with accuracy O ( raPO f y(fc) ) ensures that all coefficients of d(f) are O close to the corresponding 
coefficients of d(r) with high probability. 

Consequently, in Lemma l22l we show that when all coefficients of d(r) are within O (%) of the cor- 
responding coefficients of d(r), the smallest positive root of d(r), a, is at most e away from the smallest 
positive root a of d(r). 

Observing that there exist many efficient numerical methods for estimating roots of polynomial of one 
variable within the desires accuracy completes the proof. 



Lemma 19 Consider the (fc + 1) x (k + 1) Hankel matrix V, Tij = (yi+j(x, r)) for i,j = 0,l,...,k, where 
7„ (x, r) is the n th Hermite polynomial as described above. Then dct (r) [x, r) is a homogeneous polynomial 
of degree k{k + 1) of two variables x and r. 

Proof: It is easy to see from the definition that the nth Hermite polynomial 7„(x, r) is a homogeneous 
polynomial of two variables of degree n. Thus we can represent the degree of each polynomial term of the 
matrix T as follows 

"0 1 2 ••• k ' 

1 2 3 ••• fc + 1 

2 3 4 ••■ k + 2 

k fc + 1 k + 2 ■■■ 2k 

Now reduce the degree of each element of row i by taking degree (i — 1) by taking it outside the matrix. 
The resulting matrix will have degree (i — 1) for all the elements in column i, i = 1, 2, (fc + 1). Then 
reduce the degree of each element of column i by (i — 1) by taking it outside the matrix. The degree of each 
element of the resulting matrix is 0. The remaining matrix has zeros everywhere. Thus we see that when the 
determinant is computed, the degree of each (homogenous) term is 2 x (1 + 2 + • • • + fc) = fc(fc + 1). ■ 

We have the following simple corollary. 

Corollary 20 d(r) is a polynomial of of degree fc(fc + 1), with the coefficient of the leading term independent 
of the probability distribution p. Similarly, d(r) is a polynomial of of degree fc(fc + 1), with the leading term 
having coefficient independent of the coefficients of the sampled data. 

Proof: From Lemma[T9l notice that det(r(cc, r)) is a homogeneous polynomial of degree fc(fc + 1) and hence 
the non-zero term r fe ( fe + 1 ) cannot include x. Since M(r) is obtained by replacing x l by E(x l ), the leading 
term of d(r) is independent of the probability distribution. Similarly, M(r) is obtained by replacing x % by 

— z=l — 1 and the result follows. ■ 



Lemma 21 Let fix) = x rn + a m -\x m 1 + a m _2X m 2 + • • • + a±x + ao be a polynomial having a smallest 
positive real root xq with multiplicity one and f (xq) ^ 0. Let f(x) = x' n + a m —\X m + a m _22:" l ~ 2 + 
• • • + d\x + ao be another polynomial such that \\a — a|j < e for some sufficiently small e > 0, where 
a = (ao, a±, . . . , a TO _i) and a = (ao, Si, ■ • ■ , a m _i). Then there exists a C > such that the smallest 
positive root xq of f(x) satisfies \\xq — soil < Ce. 

Proof: Let a = (ao, oi, . . . , a m -i) be the coefficient vector. The root of the polynomial can be written as a 
function of the coefficients such that x(a) = xq. Thus we have x m (a) + a m _ix m-1 (a) + a m _2^ m_2 (a) + 
■ ■ ■ + ciix(a) + ao = 0. Taking partial derivative with respect to at we have, 

[mx" 1 - 1 ^) + a m _i(m - l)a; m - 2 (a) + a m _ 2 (m - 2)x m " 3 (a) + • • ■ + a 2 2x(a) + aj +x i (a) = 



so that we can write 



/y»m-l 

iiv, (a )r 



\f'(x(a))\ 

Note that | /' (x) | at the root x — xq is lower bounded by some c\ > 0. Since f"(x) is also a polynomial, 
can be upper bounded by another C2 > within a small neighborhood of xq and hence |/'(x)| can 
be lower bounded by some C3 > within the small neighborhood around xq. This neighborhood can also be 
specified by all £ within a ball B(a, e) of radius e > 0, sufficiently small, around a. For sufficiently small e, 
the polynomial X^=o x, (£)> where £ G S(a, e), must be upper bounded by some C4 > 0. Thus there exists 
some constant C > such that sup^ gB( - a e s || Vx(£)|| < C. 

Now applying mean value theorem, 

\x(a) - x(a)\ < \\a - a\\ sup ||Vx(£)|| < Ce 

£SB(a,e 4 ) 



Lemma 22 Lef cr fee ?/ie smallest positive root of d{r). Suppose d(r) be the polynomial where each of the 
coefficients of d(r) are estimated within e error for some sufficiently small e > 0. Let a be the smallest 
positive root of d(r). Then \a — a\ = O(fce). 

Proof: We have shown in Corollary l20l that d(r) is a polynomial of degree k(k + 1) and the leading term 
has some constant coefficient. Consider a fixed set of k means. This fixed set of means will give rise 
to a polynomial d(r) and d(r), where means contribute in deciding the coefficients of the corresponding 
polynomials, for which according to Lemma |2T1 there exists a C > such that \a — a\ < Cke. Since 
all possible sets of k means form a compact subset, there exists a positive minimum of all the Cs. Let this 
minimum be C* . This proves that \a — cr\ = O(ke). ■ 



C.l Properties of the entries of matrix M 

From the construction of the matrix AI it is clear that it has 2k different entries. Each such entry is a 
polynomial in r. Let us denote these distinct entries by ra^r) — E[ji(x, t)], i = 1,2,..., 2k. Due to the 
recurrence relation of the Hermite polynomials we observe the following properties of rrii(T)s, 

1 . If z is even then maximum degree of the polynomial irii (r) is i and if i is odd then maximum degree is 

2. For any rrii (t), each term of (r) has an even degree of r. Thus each m; (r) can have at most i terms. 

3. The coefficient of each term of mj(r) is multiplication of a constant and an expectation. The constant 
can be at most (2k)l and the expectation can be, in the worst case, of the quantity X 2k , where X is 
sampled from p. 

Note that the empirical version of the matrix M is M where each entry m^r) is replaced by its empirical 
counterpart m^r). Using standard concentration inequality we show that for any m^r), its coefficients are 
arbitrarily close to the corresponding coefficients of rhi (t) provided a large enough sample size is used to 
estimate rhi{r). 

Lemma 23 For any m^r), i = 1, , 2, . . . , 2k, let {3 be any arbitrary coefficient of the polynomial to^t). 
Suppose X\, X2, ■ ■ ■ ,-Xjv iid samples from p is used to estimate mj(r) and the corresponding coefficient is 
f3. Then there exists a polynomial r]i(k) such that for any e > and < i5, \/3 — /3| < e with probability at 
least 1 — S, provided N > - . 



Proof: Note that in the worst case f3 may be a multiplication of a constant which can be at most (2k)l and 

N 



the quantity E(X 2 p). First note that E ( ^ Y,i=i X f k ) = ^(X 2k ). Now, 



\ i=i / 



Var(X 2k ) 
N 

= ^E(X 2k -E(X 2k )f 
= ±(E(X ik )-(E(X 2k )y 



< 



N 

(16nk 2 ) 2k 
N 



The last inequality requires a few technical things. First note that once the Gaussian mixture is pro- 
jected from W™ to R mean of each component Gaussian lies within the interval [—y/n, y/n}. Next note 
that for any X ~ A/"(/i, a 2 ), expectation of the quantity X 1 for any i can be given by the recurrence relation 
E(X l ) = /iE(X* _1 ) + (i— l)cr 2 E(X' l ~ 2 ). From this recurrence relation we see that E(X Ak is a homogeneous 
polynomial of degree 4fc in fi and a. Since < -y/n and assuming cr < -y/n each term of this homogeneous 
polynomial is less than (\fn) 4k = n 2k . Next we argue that the homogeneous polynomial E(X 4k ) can have at 
most (4k) ! terms. To see this let Xi be the sum of the coefficients of the terms in appearing in the homogeneous 
polynomial representing expectation of X' 1 . Note that xq = x\ = 1. And fori > 2, Xi = Xi-i + (i — l)xi-2- 
Using this recurrence relation, we have xa p = X4 P _i + (4p— 1)xa p -2 < X4 P -i + (4p— l)xAp-i = 4px4 P -i < 
4p(4p - l)x 4p _ 2 < 4p(4p - l)(4p - 2)a; 4p _3 = ■ ■ ■ = 4p(4p - l)(4p - 2)(4p - 3) ■ • • (3)(2)(1) = (4p)l. 
Thus the homogeneous polynomial representing expectation of X 4fe has at most (4fc)! terms and each term is 
at most n 2k . This ensures that E(X 4k ) < (Ak)\n 2k < (Ak) 4k n 2k < (16nk 2 ) 2k . Note that this upper bound 
also holds when X is samples from a mixture of k univariate Gaussians. 

Now applying Chebyshev's inequality, we get 



£££i*P-E(* 2fc ) 



((2fc)!) 2 Var(^ (2fc) 4fc (16»fc 2 ) 2fc (64»fc 4 ) 2 

(2ft)! / — 7 1 - NT 1 — We 1- 



Noting that the constant term in f3 can be at most (2k) ! and upper bounding the last quantity above by s 



and applying union bound ensures the existence of a polynomial 771 (k) and yields the desired result. 
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C.2 Concentration of coefficients of d(r) 

In this section we show that if the coefficients of the individual entries of the matrix M (recall each such entry 
is a polynomial of r) are estimated arbitrarily well then the coefficients of d(r) are also estimated arbitrarily 
well. 

Lemma 24 There exists a polynomial T] 2 (k) such that if coefficients of each of the entries of matrix M ( where 
each such entry is a polynomial of t) are estimated within error then each of the coefficients of d(r) 

are estimated within e error. 

Proof: First note that M is a (k + 1) x (k + 1) matrix. While computing the determinant, each entry of 
the matrix M is multiplied to k different entries of the matrix. Further each entry of the matrix (which is 
a polynomial in r) can have at most 2k terms. Thus in the determinant d(r), each of the coefficients of 
r 21 , i = 1,2,..., k ( k +^ has only rj^k) term for some polynomial i]i(k). Consider any one of the r]i(k) 
terms and let us denote it by b. Note that b is multiplication of at most k coefficients of the entries of M, 
Without loss of generality let us denote b = /3j/?2 .../?/ where I can be at most k. Let b be the estimation of 
b given by b = $i$2 ■ ■ ■ $1 such that for any 1 < i < I, |/3j — < for some > 0. For convenience we 
will write /3j = /3, + e* . Then we can write 

|&-S| = LMa... A -G9i+e*)G9 2 +e*) •••(# + *♦)! 

< (die* +a 2 e 2 + ■■■ +a ( _iel'~ 1) + e[) 

< (a\ + a 2 + • • • + a;_i + l)e* 



where a, is a summation of 773 (k) terms for some polynomial 7/3 (k) and each term is a multiplication of at 
most (I - 1),/3jS. Note that each j3j can have value at most (2fc)!(2fc)!(v^i) 2fc < (2k) 4k n k = (16nk 4 ) k . 



Thus (ai+a 2 + ha;_i + l) < kr] 3 (k)(16nk 4 ) k . Clearly |6-S| < kr) 3 (k)(16nk 4 ) k e*. Thus there exists 

some polynomial 772 such that if we set = , then the coefficients of of d(r) are estimated within error 

fcr ?3 (fc)(16nfc 4 ) fc ^ 7 < e. ■ 



